Construction and validation of machine learning models for predicting distant metastases in newly diagnosed colorectal cancer patients: A large‐scale and real‐world cohort study

Abstract Background More accurate prediction of distant metastases (DM) in patients with colorectal cancer (CRC) would optimize individualized treatment and follow‐up strategies. Multiple prediction models based on machine learning have been developed to assess the likelihood of developing DM. Methods Clinicopathological features of patients with CRC were obtained from the National Cancer Center (NCC, China) and the Surveillance, Epidemiology, and End Results (SEER) database. The algorithms used to create the prediction models included random forest (RF), logistic regression, extreme gradient boosting, deep neural networks, and the K‐Nearest Neighbor machine. The prediction models' performances were evaluated using receiver operating characteristic (ROC) curves. Results In total, 200,958 patients, 3241 from NCC and 197,717 CRC from SEER were identified, of whom 21,736 (10.8%) developed DM. The machine‐learning‐based prediction models for DM were constructed with 12 features remaining after iterative filtering. The RF model performed the best, with areas under the ROC curve of 0.843, 0.793, and 0.806, respectively, on the training, test, and external validation sets. For the risk stratification analysis, the patients were separated into high‐, middle‐, and low‐risk groups according to their risk scores. Patients in the high‐risk group had the highest incidence of DM and the worst prognosis. Surgery, chemotherapy, and radiotherapy could significantly improve the prognosis of the high‐risk and middle‐risk groups, whereas the low‐risk group only benefited from surgery and chemotherapy. Conclusion The RF‐based model accurately predicted the likelihood of DM and identified patients with CRC in the high‐risk group, providing guidance for personalized clinical decision‐making.


| INTRODUCTION
Colorectal cancer (CRC) is the third most common malignant tumor worldwide, with the number of CRC patients expected to reach approximately 2.5 million by 2035. 1,2Distant metastases (DM) are present in 20% of the patients newly diagnosed with CRC.Although great improvements have been made in the treatment of CRC, metastatic CRC remains a fatal disease that leads to half of all CRC-related deaths. 35][6] The most common site of CRC metastasis is the liver, followed by the lungs.Thus, chest and abdominal computed tomography (CT) is recommended for detecting hepatic and pulmonary metastases.Metastases of CRC to the bones and brain are relatively rare.The incidence of bone metastases in CRC patients varies from 6.0% to 10.4%. 7,8However, the prognosis for bone metastases is poor.The 5-year survival rate of patients with bone metastases is <5%, and the median survival of these patients is <7 months. 9,10Owing to the low incidence and asymptomatic nature of bone metastases, bone imaging is often ignored in clinical practice.Until patients with CRC present with symptoms of metastasis-induced bone destruction, such as skeletal-related events, diagnostic imaging tests are suggested for bone metastasis localization.Thus, CRC patients with bone metastases may miss optimal therapeutic opportunities. 11Similar to bone metastases, brain metastases are rare events that typically occur later in the course of CRC.3][14][15] Despite their low incidence, brain metastases from CRC progress aggressively and have a poor prognosis; the median survival after the diagnosis of brain metastases is <8 months. 16,17Presently, the diagnosis of brain metastases mainly depends on neurological symptoms, and routine neurological imaging is not recommended for patients newly diagnosed with CRC. 18Given the poor prognosis for DM in patients with CRC, a more precise model for predicting DM is needed.
As a vital aspect of artificial intelligence, machine learning has performed well in diagnosing and predicting multiple diseases and has exhibited higher accuracy than conventional methods in clinical settings. 19,202][23] However, no machine-learning-based model has been developed to predict the possibility of DM in patients newly diagnosed with CRC.Therefore, this study was designed to establish novel machine learning-based models to predict the risk of CRC metastases to the liver, lung, bone, and brain using clinicopathological data from the National Cancer Center (NCC, Beijing, China) and the Surveillance, Epidemiology, and End Results (SEER) database.This could help clinicians promptly detect DM and select appropriate treatment strategies to improve prognosis.Furthermore, risk stratification based on this risk prediction model was created to classify patients newly diagnosed with CRC into different groups according to the risk of DM, for predicting the prognosis and treatment response of patients with metastatic CRC and help clinicians select the optimal treatment.

| Feature engineering and data transformation
To create prediction models, feature engineering approaches were used to process the easily accessible clinicopathological data from the SEER database and the Electronic Medical Record System of the NCC.We used cross-validation (CV) and recursive feature elimination iteratively to filter variables using a random forest (RF) classifier to increase the accessibility of the prediction models.Cross-validation was used for internal validation as a reliable method to monitor the development of machine learning and enhance the performance of the models.The variables were evaluated based on their relative significance in the receiver operating characteristics (ROC) of the prediction models.subsequently assessed and verified in the test and external validation sets.The RF, logistic regression (LR), extreme gradient boosting (XGboost), deep neural network (DNN), and K-Nearest Neighbor (KNN) algorithms were developed by performing a 10-fold CV in the training set.Variables that were strongly associated with the probability of DM were identified using univariate and multivariate logistic regression analyses.The mutual relationships between the variables incorporated in this study were analyzed using correction analysis.These prediction models were built and validated using the "caret" and "gbm" packages.
Our preliminary findings indicated that the performances of these various machine learning methods for DM prediction were essentially similar.However, the DeLong test showed that on the training, test, and external validation sets RF exhibited a tendency toward improved availability.Further to quantify the risk of DM developing in patients with CRC, we formulated a risk score for each patient using the RF model and based thereon ranked the patients from high to low.The risk scores were calculated with RF algorithm based on the age at CRC diagnosis, race, gender, year at diagnosis, tumor histology, tumor location and size, T-stage, N-stage, tumor grade, harvested lymph nodes, and primary tumor, and every patient had a private risk score.A higher risk score means potentially higher risk for distant metastases and vice versa.According to risk scores of each CRC patient, which were ranked from high to low, these CRC patients were separated into three risk groups, with about 65,906 (33.3%) patients in each group.Those with the highest risk scores were assigned to the high-risk group, whereas those with the lowest risk scores were assigned to the low-risk group.

| Statistical analysis
The Mann-Whitney U test was used to compare differences across continuous variables, whereas the chi-squared test was used for categorical variables.In the survival analysis, the Kaplan-Meier method and log-rank test were used to determine the prognostic differences between different risk groups.The CRC patients, whether undergoing therapy or not (including surgery, chemotherapy, and radiotherapy), were divided into 1:1 groups using Propensity Score Matching (PSM).To evaluate the performance of the different models, sensitivity, specificity, Gini coefficient, the area under the ROC curve (AUC), and 95% confidence intervals (CIs) were generated based on the number of correctly recognized true-positive instances and the number of incorrectly categorized false-positive instances.All analyses were performed using version 3.6.1 of R.

| Clinicopathological characteristics
In total, data from 197,717 CRC patients from the SEER database and 3241 patients from the NCC database were used in this study (Figure 1).From the SEER database, 20,043 (11.3%) patients developed DM of the CRC, including the bone, brain, liver, and lungs.In total, 88,181 (45.0%)CRC patients >70 years old, 95,286 (48.2%) patients suffered from right colon cancer, 53,055 (26.8%) patients suffered from left colon cancer, 49,403 (25.0%) patients suffered from rectum cancer.One hundred thirty nine of 825 (70.7%) patients with CRC were at the T3/T4 stage and 85,142 (43.1%) were at the N1/N2 stage (Table S1).Additionally, those from the SEER database were randomly separated into a training set (n = 158,174) and a test set (n = 39,543) in an 8:2 ratio (Figure 1).Table 1 summarizes the clinicopathological characteristics of the patients enrolled in this study.

| Model performance
To create reliable and accurate predictive models, recursive feature elimination and 10-fold-CV were employed to select features iteratively based on the implementation of To predict DM in CRC patients, five machine learning models based on the 12 aforementioned features were developed using the RF, LR, XGboost, DNN, and KNN algorithms.To evaluate the performance of the models, the AUC with p value and 95% confidence interval, specificity, sensitivity, and Gini score were calculated (Tables S2-S5).In comparison with the KNN (AUC = 0.820, 0.723, and 0.759), DNN (AUC = 0.776, 0.774, and 0.724), XGB (AUC = 0.802, 0.788, and 0.785), and LR (AUC = 0.797, 0.794, and 0.785) models, the findings revealed that the RF model exhibited highest accuracy on the training set (AUC = 0.843), test set (AUC = 0.793), and external validation set (AUC = 0.806).The DeLong test was performed to analyze the ROC of these risk models, which showed that the RF model substantially improved the ROC for predicting DM compared with the other risk models (Figure 2, Tables S3-S5).After analyzing the specificity, sensitivity, AUC, and Gini scores of all models, the RF model exhibited the best performance (Table S2).The sensitivities and specificities of the prediction models were identical.By comparing the gain values of the characteristic with the DM prediction, we assessed the significance of each (Tables S6-S10).Despite the different models exhibiting slight variances in the importance of features the findings of the overall results showed that in all models, the N stage and T stage were the most important risk factors.In the RF model, N stage, T stage, tumor size, and harvested lymph nodes were the most significant risk factors for predicting the likelihood of developing DM (Figure 3).

| Risk stratification for patients
We used an RF classifier to calculate the risk score for each CRC patient to predict their likelihood of developing DM.According to the risk scores of each CRC patient, which ranged from high to low, these CRC patients fell into three risk groups, with approximately 65,906 (33.3%) in each group (Figure 4A-F, Table S11).Those with the highest risk scores were assigned to the high-risk group, whereas those with the lowest risk scores were assigned to the low-risk group.The remaining patients were assigned to the middle-risk group.In the high-risk group, 14,553 (22.1%) patients developed overall metastases, 4426 (6.7%) patients in the middle-risk group developed overall metastases, and 1064 (1.6%) patients in the low-risk group developed overall metastases.The risk classification was also appropriate for the proportion of patients with single CRC metastases to the liver, lung, brain, and bone, suggesting that patients in the high-risk group had the highest rates of DM and patients in the low-risk group had the lowest DM risk.Regarding patients who developed multiorgan metastases, the high-risk group had proportionally more than the other groups.Furthermore, we compared the 5year overall survival (OS) rates among the three groups (Figure 5A-C).The survival analysis demonstrated distinct differences in the survival probabilities among them.The CRC patients with the highest risk scores had the lowest OS rate, while patients in the low-risk group had the highest OS rate.

| Treatment benefits for three risk groups
Using propensity score matching based on age and year of CRC diagnosis, race, sex, T stage, N stage, tumor size, and histology, we balanced the clinicopathological features of patients with and without treatment to assess the benefit of treatment (surgery, chemotherapy, and radiotherapy) for patients with CRC in the three risk score groups.Furthermore, we examined the OS of patients with balanced baseline characteristics in the three risk groups (Figure 6A-I, Figure S1).The results revealed that surgery, chemotherapy, and radiotherapy could significantly improve OS for CRC patients with CRC in the high-and middle-risk groups.However, patients in the low-risk group only benefited from surgery and chemotherapy whereas, concerning OS, radiotherapy failed to improve survival benefits.

| DISCUSSION
As a relatively aggressive disease, approximately 20% of CRC patients developed DM at the initial diagnosis, which can lead to treatment failure and poor prognoses.Hence, timely diagnosis and treatment for metastases from CRC are expected to improve clinical outcomes.As previously mentioned, the liver and lung are the most common metastatic sites.Therefore, liver and lung imaging examinations are routinely implemented to detect hepatic and pulmonary metastases during the primary diagnosis of CRC.Moreover, despite rarities, the incidence of bone and brain metastases is increasing due to the development of multidisciplinary treatment and prolonged survival of patients with CRC.Moreover, a scoring system that comprised several factors, including rectal cancer, poor differentiation, CEA positivity, and metastases to lymph nodes and ectosteal organs, was constructed to predict the likelihood of bone metastases for patients newly diagnosed with CRC. 27Additionally, another scoring system that predicted the risk of developing bone metastases in CRC patients who received radical resection was reported, which could divide patients into different risk groups according to three independent risk factors, namely, pulmonary metastases, lymph node metastases, and rectal cancer. 28Han et al. found that tumor location, grade, T/N stage, and other factors were correlated with the occurrence of bone metastases and then developed a nomogram based on these factors to predict bone metastases. 29In the cases of brain metastases, studies on risk factors remain limited.Michl et al. conducted a study with the largest number of CRC patients who had brain metastases, revealed that brain metastases were the last event in the disease and tumor location was correlated with brain metastases.Moreover, the primary tumors of patients with brain metastases were predominantly left-sided, particularly in the rectum. 30Other studies also demonstrated that brain metastases occurred more commonly in patients with primary left-sided tumors. 31

F I G U R E 3
The feature importance for predicting postoperative complications of random forest.
most patients with brain metastases from CRC also had concomitant metastases in lungs, liver, and bone.Among these extracranial metastases, the lung was the most common site. 324][35] In addition, Lu et al. utilized the deep learning model based on MRI data to accurately assess pelvic lymph node metastasis in rectal patients, which was better than the diagnosis and identification of radiologists in terms of both diagnostic quality and speed. 36However, the application of radiomics is based on the hypothesis that a F I G U R E 4 Risk levels for predicting distant metastasis of CRC using RF.The risk scores of developing overall distant (A), liver (B), lung (C), brain (D), bone (E), and multiple-organ (F) metastasis based on RF.Sorted by risk scores form high to low, patients with CRC were divided into three risk groups of the same number: high-risk, middle-risk, and low-risk groups.The distant metastasis rates were significantly higher in the high-risk group than in other groups.(***p value < 0.001).CRC, colorectal cancer; RF, random forest.
large amount of information from images can be extracted by radiomics, translating macroscopic image-based features into pathologic information.In radiomics studies, a large number of unexplainable radiomics features were extracted from images, most of which lack clinical significance.Most existing predictive models developed using complex and diverse radiomics data features are extremely difficult to reproduce and popularize in clinical practice. 37urthermore, the wide range of imaging protocols, scanner types, and diagnostic criteria for tumor metastases affects the accuracy, leading to highly heterogeneous results.
In the present study, we used clinicopathological data from the NCC and SEER database to establish the first risk model using RF to predict the probability of DM (including liver, lung, and bone and brain metastases) in patients newly diagnosed with CRC.We found that N stage, T stage, tumor size, harvested lymph nodes, age at diagnosis, and tumor site played vital roles in this clinicopathological characteristic-based risk model, which partially accorded with the aforementioned studies.8][29] Thus, we have developed precise risk models using logistic regression and different machine learning algorithms, such as RF, XGboost, DNN, and KNN.It has been shown that RF model exhibited the highest AUC in all cohorts, suggesting that RFbased prediction model had better discrimination than other machine learning algorithms and logistic regression, which could accurately identify patients who probably need further costly examinations (such as PET-CT) to detect potential distant metastases.][29] Remarkably, the RF-based risk model in this study can precisely evaluate the likelihood of multiple DM from CRC, including liver, lung, and bone and brain metastases, offering physicians a more convenient predictive tool to identify patients with a higher risk of DM during primary diagnosis.Furthermore, according to the DM risk score obtained from our risk model, we can divide patients newly diagnosed with CRC into the high-, middle-, and low-risk groups.Patients in the high-risk group had a much greater possibility of developing DM at the primary CRC diagnosis, alerting physicians to promptly detect metastases through intensive screening modalities, such as PET-CT and neurological imaging, even though these are not recommended by clinical guidelines.For patients in the middle and low-risk groups, routine follow-up might be appropriate.Therefore, with this predictive clinical tool, physicians can evaluate the risk and improve surveillance managements for patients according to their DM risk score.Moreover, the DM-risk stratification based on the risk model in this study first analyzed the efficacy of different treatment regimens for patients with CRC by comparing survival outcomes of patients receiving different treatments in three risk-level groups.It was shown that surgery, chemotherapy, and radiotherapy were beneficial for patients in high and middle-risk groups.Consistent with the results of our study, previous studies found that surgery or surgery plus radiotherapy was related to better survival outcomes in patients with CRC with brain metastases. 38For patients in the lowrisk group, surgery and chemotherapy could improve prognosis, while radiotherapy was not correlated with better survival outcomes, suggesting that radiotherapy might not be an ideal treatment option for patients in the low-risk group and could cause unnecessary radiationinduced injuries.To date, this is the first study to build a risk model using multiple machine learning algorithms to predict the likelihood of developing multi-organ metastasis in newly diagnosed CRC patients.The large sample size (197,717 patients from SEER and 3241 patients from the NCC) was a distinctive advantage of this populationbased study.Moreover, we used the external NCC cohort to validate the prediction model.This RF-based risk model exhibited good accuracy for predicting DM in all cohorts, which can be regarded as another strength.Furthermore, in view of clinical practicability, the risk factors in this risk model, including T/N stage, tumor size, tumor site, and age at diagnosis, are accessible clinicopathological characteristics, which can be easily obtained in routine clinical practice, thus helping physicians predict the risk of metastases and take more individualized examinations and surveillance strategies.Our prediction models could assess the benefits of The OS comparison between CRC patients who received surgery and those who did not undergo surgery in the high-risk group (A), middle-risk group (B), and low-risk group (C).The OS comparison between CRC patients who received chemotherapy and those who did not receive chemotherapy in the high-risk group (D), middle-risk group (E), and low-risk group (F).The OS comparison between CRC patients who received radiotherapy and those who did not receive radiotherapy in the high-risk group (G), middle-risk group (H), and low-risk group (I).CRC, colorectal cancer; OS, overall survival.surgery, chemotherapy, and radiotherapy for CRC patients belonging to different risk score groups, which could assist clinicians to select optimal treatment.However, there were following shortcomings must be acknowledged in our study.First, the retrospective nature of this study should be noted, and the absence of some clinicopathological factors inevitably resulted in bias.Additionally, because the patients involved in this study could only represent the American and Chinese populations, further study recruiting patients from more countries is needed.Besides, the lack of genetic information in the SEER database, such as RAS/BRAF/ MSI mutation status, which could improve the accuracy and use of the risk model, is another disadvantage of this study.And family history of cancer might be a potential cofounding factor, thus we collected the patients without family history of cancer in the external validation cohort from the NCC to avoid bias.However, there was no detailed information about family history of cancer for patients from the SEER database, which was a shortcoming for our study.Last but not least, the SEER database lacked a detailed record of patient-specific chemotherapy regimens.The application of chemotherapy with different regimens and courses might have an influence on both prognosis and DM, thus validation of the influence brought by different chemotherapy regimens for DM risk and survival outcomes should be further conducted.Meanwhile, the time to surgery and adjuvant therapy can definitely have an impact on DM, which also was lacking in the SEER database necessitating further analysis.As for patients in external validation set from the NCC, the chemotherapy regimens for them included Xelox (Oxaliplatin and Capecitabine) and FOLFOX (5-Fluorouracil, Calcium Folinate, and Oxaliplatin).There was no delay in the time to surgery and adjuvant therapy of patients who recruited in the external set.Despite these limitations, this risk model and stratification remain a useful clinical tool for predicting the possibility of DM at the initial diagnosis of CRC and providing guidance for clinical decision-making.

| CONCLUSION
In summary, using clinicopathological data from the NCC and SEER databases, we established a novel RF-based risk model to predict the possibility of metastases to multiple organs, including the liver, lungs, bone, and brain, in patients with newly diagnosed CRC.According to the risk of DM, this prediction model could stratify patients into high-, middle-, and low-risk groups, which could assist physicians in identifying patients with high risk of developing DM and thus optimize therapeutic management.

| 3 of 14 WEI
et al. lungs, bones, and brain.The diagnosis of cancer was based on topographic or histological classification, following the International Classification of Diseases for Oncology-3 (ICD-O-3)/World Health Organization 2008 guidelines.The American Joint Committee on Cancer (AJCC) 6th edition and SEER combined stage (2016+) TNM staging were used in our study.The exclusion criteria were as follows: (1) unknown metastatic status at initial diagnosis, (2) unknown pathohistological diagnosis, (3) age < 20 years, (4) diagnosis of benign or borderline tumors, and (5) lack of complete data on treatment and clinicopathological characteristics.Demographic and tumor characteristics were collected, and this process is shown in Figure 1.

2. 4 |
Development of risk models and risk stratification CRC patients drawn from the SEER database were randomly separated into training and test sets in an 8:2 ratio, and those from the NCC populated the external validation set.The risk models for predicting the likelihood of developing DM were built into the training set and were F I G U R E 1 Flow diagram of the study population.In this study, a total of 197,717 CRC patients from SEER database were included, which were divided into the independent training and independent test sets in a ratio of 8:2, and 3241 patients from NCC were included in external validation set.The predictive models and risk stratification were established to help provide reliable individual information for CRC treatment recommendations.CRC, colorectal cancer; DNN, deep neural network; KNN, K-Nearest Neighbor; LR, logistic regression; NCC, National Cancer Center; RF, random forest; SEER, Surveillance, Epidemiology, and End Results; XGboost, extreme gradient boosting.
the onset of neurological symptoms.Knowledge about the early detection of DM (especially for bone and brain metastases) remains insufficient, and potential predictive risk factors are poorly understood.Our study demonstrated that N stage, T stage, tumor size and site, and amount of harvested lymph nodes were important risk factors for DM (including metastases to bone and brain).Previous studies have reported risk factors favoring bone metastases and risk prediction models, which was partially in line with results of our study.Sun et al. revealed that the tumor site and lymph node invasion contributed to bone metastases in CRC patients after curative resection.

F I G U R E 5
The survival comparison among high-risk, middle-risk, and low-risk groups.The overall survival were significantly worse in high-risk group than in middle-risk group and low-risk group in training set (A), test set (B), and external validation set (C).
Demographic and tumor characteristics of patients with colorectal cancer in SEER and NCC databases.
T A B L E 1the RF classifier in the training set.Furthermore, to predict the likelihood of developing DM in newly diagnosed CRC patients before receiving treatment, and to help physicians select the optimal treatment for these patients, the features associated with the treatment (including surgery and chemoradiotherapy) were excluded from the prediction models.Twelve other features (age at CRC diagnosis, race, sex, year at diagnosis, tumor histology, tumor location and size, T-stage, N-stage, tumor grade, harvested lymph nodes, and primary tumor) were evaluated during the development of the machine learning-based models.

Characteristic Training set from SEER Test set from SEER External validation set from NCC Nonmetastasis (n = 142,047) Metastasis (n = 16,127) p value
Note: p values were calculated using the chi-squared test for categorical variables.Abbreviations: CRC, colorectal cancer; NCC, National Cancer Center; SEER, Surveillance, Epidemiology, and End Results.T A B L E 1 (Continued) | 7 of 14 WEI et al.
Univariable and multivariable logistic regression analyses for patients with metastatic CRC in SEER database.
24,25Because of low incidence, relevant studies are lacking, and the therapeutic standards for bone and brain metastases have not yet been established.In clinical practice, the diagnosis of bone metastases from CRC relies on the manifestation of acute complications of bone destruction.Similarly, brain imaging is not conventionally performed to detect brain metastases untilT A B L E 2Note: Logistic regression analysis was used to calculate the hazard ratio (HR) and 95% confidence interval (CI) based on metastatic CRC.Covariables that were significant in univariable logistic regression analysis (p < 0.05) are included in the multivariable analysis.Abbreviations: CI, confidence interval; CRC, colorectal cancer; HR, hazard ratio; SEER, Surveillance, Epidemiology, and End Results.